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Abstract 

t-h ■ Successful deployment of cognitive radios requires efficient sensing of the spectrum and dynamic 

o ■ 

■ adaptation of the available resources according to the sensed (imperfect) information. While most works 

J> ' design these two tasks separately, in this paper we address them jointly. In particular, we investigate an 

overlay cognitive radio with multiple secondary users that access orthogonally a set of frequency bands 



in 



originally devoted to primary users. The schemes are designed to minimize the cost of sensing, maximize the 
performance of the secondary users (weighted sum rate), and limit the probability of interfering the primary 
users. The joint design is addressed using dynamic programming and nonlinear optimization techniques. A 
c/3 . two-step strategy that first finds the optimal resource allocation for any sensing scheme and then uses that 

solution as input to solve for the optimal sensing policy is implemented. The two-step strategy is optimal, 
gives rise to intuitive optimal policies, and entails a computational complexity much lower than that required 
t^J- _ to solve the original formulation. 

in" 

Q^ 1 Index Terms 

o; 

Cognitive radios, sequential decision making, dual decomposition, partially observable Markov decision 
processes 

^ . I. Introduction 

h: 

Cognitive radios (CRs) are viewed as the next-generation solution to alleviate the perceived 



spectrum scarcity. When CRs are deployed, the secondary users (SUs) have to sense their radio 
environment to optimize their communication performance while avoiding (limiting) the interference 
to the primary users (PUs). As a result, effective operation of CRs requires the implementation of 
two critical tasks: i) sensing the spectrum and ii) dynamic adaptation of the available resources 
according to the sensed information ifTOll . To carry out the sensing task two important challenges 
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are: CI) the presence of errors in the measurements that lead to errors on the channel occupancy 
detection and thus render harmless SU transmissions impossible; and C2) the inability to sense the 
totality of the time-frequency lattice due to scarcity of resources (time, energy, or sensing devices). 
Two additional challenges that arise to carry out the resource allocation (RA) task are: C3) the 
need of the RA algorithms to deal with channel imperfections; and C4) the selection of metrics that 
properly quantify the reward for the SUs and the damage for the PUs. 

Many alternatives have been proposed in the CR literature to deal with these challenges. Different 
forms of imperfect channel state information (CSI), such as quantized or noisy CSI, have been used 
to deal with CI ll20ll . However, in the context of CR, fewer works have considered the fact that 
the CSI may be not only noisy but also outdated, or have incorporated those imperfections into 
the design of RA algorithms H. The inherent trade-off between sensing cost and throughput gains 
in C2 has been investigated lfl4ll . and designs that account for it based on convex optimization 
[|24l and dynamic programming (DP) [|6|| for specific system setups have been proposed. Regarding 
C3, many works consider that the CSI is imperfect, but only a few exploit the statistical model 
of these imperfections (especially for the time correlation) to mitigate them; see, e.g., j6l, ifTTl . 
Finally, different alternatives have been considered to deal with C4 and limit the harm that the SUs 
cause to the PUs J9]|. The most widely used is to set limits on the peak (instantaneous) and average 
interfering power. Some works also have imposed limits on the rate loss that PUs experience lfT8l . 
111511 . while others look at limiting the instantaneous or average probability of interfering the PU 
(bounds on the short-term or long-term outage probability) [22J, [17J. 

Regardless of the challenges addressed and the formulation chosen, the sensing and RA policies 
have been traditionally designed separately. Each of the tasks has been investigated thoroughly 
and relevant results are available in the literature. However, a globally optimum design requires 
designing those tasks jointly, so that the interactions among them can be properly exploited. Clearly, 
more accurate sensing enables more efficient RA, but at the expense of higher time and/or energy 
consumption. Early works dealing with joint design of sensing and RA are ll28l and J6J. In such 
works, imperfections in the sensors, and also time correlation of the state of the primary channel, 
are considered. As a result, the sensing design is modeled as a partially observable Markov decision 
process (POMDP) [|4]|, which can be viewed as a specific class of DP. The design of the RA in 
these works amounts to select the user transmitting on each channel (also known as user scheduling). 
Under mild conditions, the authors establish that a separation principle holds in the design of the 
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optimal access and sensing policies. Additional works addressing the joint design of sensing and 
RA, and considering more complex operating conditions, have been published recently |[24l . lfT2l . 
For a single SU operating multiple fading channels, ll24ll relies on convex optimization to optimally 
design both the RA and the indexes of the channels to be sensed at every time instant. Assuming 
that the number of channels that can be sensed at every instant is fixed and that the primary activity 
is independent across time, the author establishes that the channels to sense are the ones that can 
potentially yield a higher reward for the secondary user. Joint optimal design is also pursued in 
[fT2ll . although for a very different setup. Specifically, [fT2ll postulates that at each slot, the CR must 
calculate the fraction of time devoted to sense the channel and the fraction devoted to transmit in 
the bands which are found to be unoccupied. Clearly, a trade-off between sensing accuracy and 
transmission rate emerges. The design is formulated as an optimal stopping problem, and solved by 
means of Lagrange relaxation of DP [5J. However, none of these two works takes into account the 
temporal correlation of the state information of the primary network (SIPN). 

The objective of this work is to design the sensing and the RA policies jointly while accounting 
for the challenges C1-C4. The specific operating conditions considered in the paper are described 
next. We analyze an overJa>3 CR with multiple SUs and PUs. SUs are able to adapt their power 
and rate loadings and access orthogonally a set of frequency bands. Those bands are originally 
devoted to PUs transmissions. Orthogonally here means that if a SU is transmitting, no other SU 
can be active in the same band. The schemes are designed to maximize the sum-average rate of 
the SUs while adhering to constraints that limit the maximum "average power" that SUs transmit 
and the average "probability of interfering" the PUs. It is assumed that the CSI of the SU links is 
instantaneous and free of errors, while the CSI of the PUs activity is outdated and noisy. A simple 
first-order hidden Markov model is used to characterize such imperfections. Sensing a channel band 
entails a given cost, and at each instant the system has to decide which channels (if any) are sensed. 

The jointly optimal sensing and RA schemes will be designed using DP and nonlinear optimization 
techniques. DP techniques are required because the activity of PUs is assumed to be correlated across 
time, so that sensing a channel has an impact not only for the current instant, but also for future 
time instants ll28ll . To solve the joint design, a two-step strategy is implemented. In the first step, 
the sensing is considered given and the optimal RA is found for any fixed sensing scheme. This 
problem was recently solved in |fT9l , ifTTl . In the second step, the results of the first step are used as 

'Some authors refer to overlay networks as interweave networks, see, e.g., (8). 
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input to obtain the optimal sensing policy. The motivation for using this two-step strategy is twofold. 
First, while the joint design is non convex and has to be solved using DP techniques, the problem in 
the first step (optimal RA for a fixed sensing scheme) can be recast as a convex one. Second, when 
the optimal RA is substituted back into the original joint design, the resulting problem (which does 
need to be solved using DP techniques) has a more favorable structure. More specifically, while the 
original design problem was a constrained DP, the updated one is an unconstrained DP problem 
which can be solved separately for each of the channels. 

The rest of the paper is organized as follows. Sec. II describes the system setup and introduces 
notation. The optimization problem that gives rise to the optimal sensing and RA schemes is 
formulated in Sec. Hill The solution for the optimal RA given the sensing scheme is presented 
in Sec. [IV] The optimization of the sensing scheme is addressed in Sec. |V] The section begins 
with a brief review of DP and POMDPs. Then, the problem is formulated in the context of DP 
and its solution is developed. Numerical simulations validating the theoretical claims and providing 
insights on our optimal schemes are presented in Sec. |VT1 Sec. IVIII analyzes the main properties 
of our jointly optimal RA and sensing policies, provides insights on the operation of such policies, 
and points out future lines of work^ 

II. System setup and state information 

This section is devoted to describe the basic setup of the system. We begin by briefly describing 
the system setup and the operation of the system (tasks that the system runs at every time slot). 
Then, we explain in detail the model for the CSI, which will play a critical role in the problem 
formulation. The resources that SUs will adapt as a function of the CSI are described in the last 
part of the section. 

We consider a CR scenario with several PUs and SUs. The frequency band of interest (the 
portion of spectrum that is licensed to PUs, or the subset of this shared with the SUs, if not all) 
is divided into K frequency-flat orthogonal subchannels (indexed by k). Each of the M secondary 
users (indexed by m) opportunistically accesses any number of these channels during a time slot 
(indexed by n). Opportunistic here means that the user accessing each channel will vary with time, 
with the objective of optimally utilizing the available channel resources. For simplicity, we assume 

2 Notation: x* denotes the optimal value of variable x; IE[-] expectation; A the boolean "and" operator; Is.x the indicator function 
= 1 if as is true and zero otherwise); and [x]+ the projection of x onto the non-negative orthant, i.e., [x]+ := max{i, 0}. 
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that there is a network controller (NC) which acts as a central scheduler and will also perform the 
task of sensing the medium for primary presence. The scheduling information will be forwarded to 
the mobile stations through a parallel feedback channel. The results hold for one-hop (either cellular 
or any-to-any) setups. 

Next, we briefly describe the operation of the system. A more detailed description will be given 
in Sec. [TTTl which will rely on the notation and problem formulation introduced in the following 
sections. Before starting, it is important to clarify that we focus on systems where the SIPN is 
more difficult to acquire than the state information of the secondary network (SISN). As a result, 
we will assume that SISN is error-free and acquired at every slot n, while SIPN is not. With these 
considerations in mind, the CR operates as follows. At every slot n the following tasks are run 
sequentially: Tl) the NC acquires the SISN; T2) the NC relies on the output of Tl (and on previous 
measurements) to decide which channels to sense (if any), then the output of the sensing is used to 
update the SIPN; and T3) the NC uses the outputs of Tl and T2 to find the optimal RA for instant 
n. Overheads associated with acquisition of the SISN and notification of the optimal RA to the 
SUs are considered negligible. Such an assumption facilitates the analysis, and it is reasonable for 
scenarios where the SUs are deployed in a relatively small area which allows for low-cost signaling 
transmissions. 

A. State information and sensing scheme 

We begin by introducing the model for the SISN. The noise-normalized square magnitude of the 
fading coefficient (power gain) of the channel between the mth secondary user and its intended 
receiver on frequency k during slot n is denoted as h™[n}. Channels are random, so that h™[n] is 
a stochastic process, which is assumed to be independent across time. The values of h^Jn] for all 
m and k form the SISN at slot n. We assume that the SISN is perfect, so that the values of h^Jn] 
at every time slot n are know perfectly (error- free). While SISN comprises the power gains of the 
secondary links, the SIPN accounts for the channel occupancy. We will assume that the primary 
system contains one user per channel. This assumption is made to simplify the analysis and it is 
reasonable for certain primary systems, e.g. mobile telephony where a single narrow-band channel 
is assigned to a single user during the course of a call. Since we consider an overlay scenario, it 
suffices to know whether a given channel is occupied or not [8J. This way, when a PU is not active, 
opportunities for SUs to transmit in the corresponding channel arise. The primary system is not 
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assumed to collaborate with the secondary system. Hence, from the point of view of the SUs, the 
behavior of PUs is a stochastic process independent of h^n}. With these considerations in mind, the 
presence of the primary user in channel k at time n is represented by the binary state variable a k [n] 
(0/idle, 1/busy). Each primary user's behavior will be modeled as a simple Gilbert-Elliot channel 
model, so that Ofc[n] is assumed to remain constant during the whole time slot, and then change 
according to a two-state, time invariant Markov chain. The Markovian property will be useful to 
keep the DP modeling simple and will also be exploited to recursively keep track of the SIPN. 
Nonetheless, more advanced models can be considered without paying a big computational price 
GUI, ED. With Pl v := Pr(a k [n] = x\a k [n - 1] = y), the dynamics for the Gilbert-Elliot model are 
fully described by the 2 x 2 Markov transition matrix P fc := [P£°, P^ 1 ; P k 10 , P k n ). Sec. [VE] discusses 
the implications of relaxing some of these assumptions. 

While knowledge of h™[n] at instant n was assumed to be perfect (deterministic), knowledge of 
ak[n] at instant n is assumed to be imperfect (probabilistic). Two important sources of imperfections 
are: i) errors in the sensing process and ii) outdated information (because the channels are not always 
sensed). For that purpose, let Sk[n] denote a binary design variable which is 1 if the kth channel 
is sensed at time n, and otherwise. Moreover, let z k [n] denote the output of the sensor if indeed 
Sk[n] = 1; i.e., if the kth channel has been sensed. We will assume that the output of the sensor is 
binary and may contain errors. To account for asymmetric errors, the probabilities of false alarm 
P k A = Pr(z k [n] = l|%[n] = 0) and miss detection P k ID = Pr(z k [n] = 0|a fc [n] = 1) are considered. 
Clearly, the specific values of P k A and P k ID will depend on the detection technique the sensors 
implement (matched filter, energy detector, cyclostationary detector, etc.) and the working point 
of the receiver operating characteristic (ROC) curve, which is usually controlled by selecting a 
threshold ll25l . In our model, this operation point is chosen beforehand and it is fixed during the 
system operation, so that the values of P k A and P k ID are assumed known. As already mentioned, 
the sensing imperfections render the knowledge of a k [n] at instant n probabilistic. In other words, 
a k [n] is a partially observable state variable. The knowledge about the value of a&[n] at instant n will 
be referred to as (instantaneous) belief, also known as the information process. For a given instant 
n, two different beliefs are considered: the pre-decision belief B k [n) and the post-decision belief 
Bf! [n). Intuitively, B k [n] contains the information about a k [n] before the sensing decision has been 
made (i.e., at the beginning of task T2), while Pf [n] contains the information about a k [n] once s k [n] 
and z k [n] (if s k [n] = 1) are known (i.e., at the end of task T2). Mathematically, if H n represents 
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the history of all sensing decisions and measurements, i.e., H n := {sfc[0], Zk[0], ■ ■ ■ , Sf.[n], Zk[n]}; 
then B k [n] := Pr(afc[n] = l|"H n _x) and B^[n] := Pi(a k [n] = l\7i n ). For notational convenience, 

r -| T 

the beliefs will also be expressed as vectors, with bk[n] := 1 — B k [n] , B k [n] and bf[n] = 

1 — B% [n] , B% [n] . Using basic results from Markov chain theory and provided that P fc (time- 
correlation model) is known, the expression to get the pre-decision belief at time slot n is 

b fc fri] =P fc bf[n-l]. (1) 

Differently, the expression to get bf[n] depends on the sensing decision Sk[n]. If Sk[n) = 0, no 
additional information is available, so that 

bf[n]=b fc [n]. (2) 
If Sk[n] = 1, the belief is corrected as bf [n] = bf (b fc [n], z k [n\ ), with 



b s (b k [n] z) ■= D z b fc [7i] ^ 

fc V / Pr(2 fc [n] = z|b fe [n])' 

where D 2 with z 6 {0,1} is a 2 x 2 diagonal matrix with entries [Dj^j := Pr(,2fc[n] = z\a k = 0) 

and [D 2 ] 2; 2 := Pr(2;jfc[n] = z\a k = 1). Note that the denominator is the probability of an outcome 

conditioned to a specified belief: Pr(^[n] = z bfcfn]) = l T D 2 bfc[n], so that (|3]) corresponds to the 

correction step of a Bayesian recursive estimator. If no information about the initial state of the PU 

is available, the best choice is to initialize bfe[0] to the stationary distribution of the Markov chain 

associated with channel k (i.e., the principal eigenvector of Pfe). 

In a nutshell, the actual state of the primary and secondary networks is given by the random 

processes ak[n] and ^[n], which are assumed to be independent. The operating conditions of our 

CR are such that at instant n, the value of h™[n] is perfectly known, while the value of ak[n] is 

not. As a result, the SIPN is not formed by a k [n], but by b k [n] and bf [n] which are a probabilistic 

description of a k \n]. The system will perform the sensing and RA tasks based on the available SISN 

and SIPN. In particular, the sensing decision will be made based on h™[n] and bfe[n], while the RA 

will be implemented based on h™ [n] and bf [n] . 

B. Resources at the secondary network 

We consider a secondary network where users are able to implement adaptive modulation and 
power control, and share orthogonally the available channels. To describe the channel access scheme 
(scheduling) rigorously, let w^n] be a boolean variable so that w™[n] = 1 if SU m accesses channel 
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k and zero otherwise. Moreover, let p™[n\ be a nonnegative variable denoting the nominal power 
SU m transmits in channel k, and let C™[n] be its corresponding rate. We say that the p^Jn] is 
a nominal power in the sense that power is consumed only if the user is actually accessing the 
channel. Otherwise the power is zero, so that the actual (effective) power user m loads in channel 
k can be written as 

The transmission bit rate is obtained through Shannon's capacity formula lfl3l : C™[n] := log 2 (l + 
h™{n]p™{n\/T) where T is a signal-to-noise ratio (SNR) gap that accounts for the difference between 
the theoretical capacity and the actual rate achieved by the modulation and coding scheme the SU 
implements. This is a bijective, nondecreasing, concave function with p™[ri\ and it establishes a 
relationship between power and rate in the sense that controlling p™[n\ implies also controlling 



The fact of the access being orthogonal implies that, at any time instant, at most one SU can 
access the channel. Mathematically, 



Note that (|4]) allows for the event of all u>™[n] being zero for a given channel k. That would happen, 
if, for example, the system thinks that it is very likely that channel k is occupied by a PU. 



The approach in this paper is to design the sensing and RA schemes as the solution of a judiciously 
formulated optimization problem. Consequently, it is critical to identify: i) the design (optimization) 
variables, ii) the state variables, iii) the constraints that design and state variables must obey, and 
iv) the objective of the optimization problem. 

The first two steps were accomplished in the previous section, stating that the design variables 
are s^\n\, w™\n\ and p™\n\ (recall that there is no need to optimize over C™[n]); and that the state 
variables are h%[n] (SISN), and b k [n] and bf[n] (SIPN). 

Moving to step iii), the constraints that the variables need to satisfy can be grouped into two 
classes. The first class is formed by constraints that account for the system setup. This class includes 
constraint dU) as well as the following constraints that were implicitly introduced in the previous 
section: Sk[n] G {0, 1}, iu™[n] £ {0, 1} and p™[n] > 0. The second class is formed by constraints 
that account for quality of service (QoS). In particular, we consider the following two constraints. 
The first one is a limit on the maximum average (long-term) power a SU can transmit. By enforcing 





(4) 



TO 



III. Problem statement 
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an average consumption constraint, opportunistic strategies are favored because energy can be saved 
during deep fadings (or when the channel is known to be occupied) and used during transmission 
opportunities. Transmission opportunities are time slots where the channel is certainly known to 
be idle and the fading conditions are favorable. Mathematically, with p m denoting such maximum 
value, the average power constraint is written as: 



E 



n=0 



N-l 

lim (1-7)E^E,<[™K>I 

\f— Kx> z — ' *— 'ft 



<p m , Vm, (5) 



where < 7 < 1 is a discount factor such that more emphasis is placed in near future instants. The 
factor (1 — 7) ensures that the averaging operator is normalized; i.e., that linijv->.oo J2 n =o(^ ~ ^)^™ = 
1. As explained in more detail in Sec. |Vj using an exponentially decaying average is also useful 
from a mathematical perspective (convergence and existence of stationary policies are guaranteed). 

While the previous constraint guarantees QoS for the SUs, we also need to guarantee a level 
of QoS for the PUs. As explained in the introduction, there are different strategies to limit the 
interference that SUs cause to PUs; e.g., by imposing limits on the interfering power at the PUs, 
or on the rate loss that such interference generates ifTTll . In this paper, we will guarantee that the 
long-term probability of a PU being interfered by SUs is below a certain prespecified threshold 
5fc. Mathematically, we require Pr{X] m w™ = l\a k = 1} < b k for each band k = 1, . . . , K. Using 
Bayes' theorem, and capitalizing on the fact that both a k and J2 m w k l are boolean variables, the 
constraint can be re- written as: 

N-l 



E 



lim $>-7)7 n a fc M]C <M 

TV— »oo z — 4 'm 

n=0 



/A k <d kl Vfc, (6) 



where A k , which is assumed known, denotes the stationary probability of the kth band being 
occupied by the corresponding primary user. Writing the constraint in this form reveals its underlying 
convexity. Before moving to the next step, two clarifications are in order. The first one is on the 
practicality of ©. Constraints that allow for a certain level of interference are reasonable because 
error-free sensing is unrealistic. Indeed, our model assumes that even if channel k is sensed as idle, 
there is a probability P k ID of being occupied. Moreover, when the interference limit is formulated 
as long-term constraint (as it is in our case), there is an additional motivation for the constraint. 
The system is able to exploit the so-called interference diversity [26]. Such diversity allows SUs 
to take advantage of very good channel realizations even if they are likely to interfere PUs. To 
balance the outcome, SUs will be conservative when channel realizations are not that good and 
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may remain silent even if it is likely that the PU is not present. The second clarification is that 
we implicitly assumed that SU transmissions are possible even if the PU is present. The reason 
is twofold. First, the fact that a SU transmitter is interfering a PU receiver, does not necessarily 
imply that the reciprocal is true. Second, since the NC does not have any control over the power 
that primary transmitters use, the interfering power at the secondary receiver is a state variable. As 
such, it could be incorporated into h™[n] as an additional source of noise. 

The fourth (and last) step to formulate the optimization problem is to design the metric (objective) 
to be maximized. Different utility (reward) and cost functions can be used to such purpose. As 
mentioned in the introduction, in this work we are interested in schemes that maximize the weighted 
sum rate of the SUs and minimize the cost associated with sensing. Specifically, we consider that 
every time that channel k is sensed, the system has to pay a price > 0. We assume that 
such a price is fixed and known beforehand, but time-varying prices can be accommodated into 
our formulation too (see Sec. IVII-BI for additional details). This way, the sensing cost at time 
n is Us[n\ := J2k £kSk[n}. Similarly, we define the utility for the SUs at time n as Usu[n] ■ = 
(Em /^ m ' M; r[ n ]C , r(^fc 1 [ n ];i ; 'fc l [ n ]))' where /3 m > is a user-priority coefficient. Based on these 
definitions, the utility for our CR at time n is Ut\p\ '■= Usu[n] — Us[n). Finally, we aim to maximize 
the long-term utility of the system denoted by Ut and defined as 

(7) 

With these notational conventions, the optimal s£[n], and p™*[n] will be obtained as the 

solution of the following constrained optimization problem. 

max Ut (8a) 

{s k [n],w™[n], P p[n]} 

s. to : ©, O] e {0, 1}, pjftn] ^ ( 8b ) 

©, © (8c) 

s k [n] e {0, 1}. (8d) 

Note that constraints in (|8bl) and (f8cl ) affect the design variables involved in the RA task (w™[n] 
and p^\n]), while (f8~dl) affects the design variables involved in the sensing task (s^fn]). Moreover, 
the reason for writing (f8bl and (f8cT ) separately is that (f8~b~l) refers to constraints that need to hold for 
each and every time instant n, while (f8cl ) refers to constraints that need to hold in the long-term. 



N-l 



E 



lim Y,^-l)l n U[n] 



n=0 
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The main difficulty in solving ([8]) is that the solution for all time instants has to be found jointly. 
The reason is that sensing decisions at instant n have an impact not only at that instant, but at future 
instants too. As a result, a separate per-slot optimization approach is not optimal, and DP techniques 
have to be used instead. Since DP problems generally have exponential complexity, we will use 
a two step- strategy to solve © which will considerably reduce the computational burden without 
sacrificing optimality. To explain such a strategy, it is convenient to further clarify the operation of 
the system. In Sec. HI] we explained that at each slot n, our CR had to implement three main tasks: 
Tl) acquisition of the SISN, T2) sensing and update of the SIPN, and T3) allocation of resources. 
In what follows, task T2 is split into 3 subtasks, so that the CR runs five sequential steps: 

. Tl) At the beginning of the slot, the system acquires the exact value of the channel gains 

• T2.1) the Markov transition matrix and the post-decision beliefs bf [n — 1] of the previous 
instant are used to obtain pre-decision beliefs b k [n] via ©; 

• T2.2) h™[n] and h k [n] are used to find s* k [n}; 

• T2.3) s* k [n] and z k [n] (for the channels for which s* k [n] = 1) are used to get the post-decision 
beliefs bf [n] via © and ©; 

• T3) h™[n] and bf[n] are used to find the optimal value of and and the SUs 
transmit accordingly. 

The two-step strategy to solve ([8]) will proceed as follows. In the first step, we will find the 
optimal u^[n] and p™[n] for any sensing scheme. Such a problem is simpler than the original one 
in © not only because the dimensionality of the optimization space is smaller, but also because 
we can ignore (drop) all the terms in © that depend only on s k [n]. This will be critical, because 
if the sensing is not optimized, a per-slot optimization with respect to (w.r.t.) the remaining design 
variables is feasible. In the second step, we will substitute the output of the first step into © and 
solve for the optimal s k [n). Clearly, the output of the first step will be used in T3 while the output 
of the second step will be used in T2.2. The optimization in the first step (RA) is addressed next, 
while the optimization in the second step (sensing) is addressed in Sec. |V] 

IV. Optimal RA for the secondary network 

According to what we just explained, the objective of this section is to design the optimal RA 
(scheduling and powers) for a fixed sensing policy. It is worth stressing that solving this problem 
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is convenient because: i) it corresponds to one of the tasks our CR has to implement; ii) it is a 
much simpler problem than the original problem in ©, indeed the problem in this section has a 
smaller dimensionality and, more importantly, can be recast as a convex optimization problem; and 
iii) it will serve as an input for the design of the optimal sensing, simplifying the task of finding 
the global solution of ®. 

Because in this section the sensing policy is considered given (fixed), Sk[n] is not a design variable, 
and all the terms that depend only on Sk[n] can be ignored. Specifically, the sensing cost Us[n] in 
(f8ab and the constraint in (|8d1 ) can be dropped. The former implies that the new objective to optimize 
is U su ■= E fc , m E [ lim ^oo YZ=o^ ~ l)l n P m wf[n]C^{hf[n],p^[n])\. With these considerations 
in mind, we aim to solve the following problem [cf. ®] 

max Usu (9a) 

{w%[n],p™[n]} 



s. to : (18bJ, &■ (9b) 

A slightly modified version of this problem was recently posed and solved in [fT71 . For this reason 
we organize the remaining of this section into two parts. The first one summarizes (and adapts) the 
results in ifTTll . presenting the optimal RA. The second part is devoted to introduce new variables 
that will serve as input for the design of the optimal sensing in Sec. |V] 

A. Solving for the RA 

It can be shown that after introducing some auxiliary (dummy) variables and relaxing the constraint 
tO/^fn] G {0, 1} to G [0, 1], the resultant problem in © is convex. Moreover, with probability 

one the solution to the relaxed problem is the same than that of the original problem; see ifTTll as 
well as |fl"6l for details on how to obtain the solution for this problem. The approach to solve © 
is to dualize the long-term constraints in (|8cl) . For such a purpose, let n m and Ok be the Lagrange 
multipliers associated with constraints © and ©, respectively. It can be shown then that the optimal 
solution to © is 

PTH ■= {(CD' 1 (K[n],* m /P m )] ; (10) 

L J + 

w k n *[ n \ '■= l{(L£>]=max„L«[n]) A (L£[n]>0)}i witn (H) 

LTM ■= L^ k [n]-6 k B s k H and (12) 

l^uM ■= rcnmnUTH) - 7r>rH as) 
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Two auxiliary variables L™ Uk [n] and L™[n] have been defined. Such variables are useful to express 
the optimal RA but also to gain insights on how the optimal RA operates. Both variables can be 
viewed as instantaneous reward indicators (IRIs) which represent the reward that can be obtained 
if w™[n] is set to one. The indicator L™ Uk [n] considers information only of the secondary network 
and represents the best achievable trade-off between the rate and power transmitted by the SU. The 
risk of interfering the PU is considered in L^[n], which is obtained by adding an interference- 
related term to L™ Uk \n\, Clearly, the (positive) multipliers n m and 9 k can be viewed as power and 
interference prices, respectively. Note that CCD) dictates that only the user with highest IRI can access 
the channel. Moreover, it also establishes that if all users obtain a negative IRI, then none of them 
should access the channel (in other words, an idle SU with zero IRI would be the winner during 
that time slot). This is likely to happen if, for example, the probability of the kth PU being present 
is close to one, so that the value of 9 k B k [n] in (fT2l) is high, rendering L£*[n] negative for all m. 

The expressions in (fl0l) -(fT3T) also reveal the favorable structure of the optimal RA. The only 
parameters linking users, channels and instants are the multipliers n m and 9 k . Once they are known, 
the optimal RA can be found separately. Specifically: i) the power for a given user-channel pair, 
which is the one that maximizes the corresponding IRI (setting the derivative of (fl3l) to zero yields 
(flOl)), is found separately from the power for other users and channels; and ii) the optimal scheduling 
for a given channel, which is the one that maximizes the IRI within the corresponding channel, is 
found separately from that in other channels. Since once the multipliers are known, the IRIs depend 
only on information at time n, the two previous properties imply that the optimal RA can be found 
separately for each time instant n. Additional insights on the optimal RA schemes will be given in 
Sec. IVII-A1 

Several methods to set the value of the dual variables 7r m and 9 k are available. Since, after 
relaxation, the problem has zero duality gap, there exists a constant (stationary) optimal value for 
each multiplier, denoted as n rn * and 9* k , such that substituting 7r m = n m * and 9 k = 9* k into (flOl) 
and ([121) yields the optimal solution to the RA problem. Optimal Lagrange multipliers are rarely 
available in closed form and they have to be found through numerical search, for example by using 
a dual subgradient method aimed to maximize the dual function associated with © [2J. A different 
approach is to rely on stochastic approximation tools. Under this approach, the dual variables are 
rendered time variant, i.e., 7r m = Tx m \n] and 9 k = 9 k [n). The objective now is not necessarily trying 
to find the exact value of 7r m * and 9 k , but online estimates of them that remain inside a neighborhood 
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of the optimal value. See Sec. IVII-CI and E4l . IfTTll for further discussion on this issue. 

B. RA as input for the design of the optimal sensing 

The optimal solution in (fT0l)- (fT3l) will serve as input for the algorithms that design the optimal 
sensing scheme. For this reason, we introduce some auxiliary notation that will simplify the mathe- 
matical derivations in the next section. On top of being useful for the design of the optimal sensing, 
the results in this section will help us to gain insights and intuition on the properties of the optimal 
RA. Specifically, let L[n] be an auxiliary variable referred to as global IRI, which is defined as 

L[n] := J2L k [n], with L k [n] := V w%*[n]I%[n\ (14) 

'k *— 'm 

Due to the structure of the optimal RA, the IRI for channel k can be rewritten as [cf. (QT]), (fT2l)l: 

L k [n] := [maxL?[n]l (15) 

q J i 

Mathematically, L[n] represents the contribution to the Lagrangian of © at instant n when p™[n] = 
Pk*[n] and u^fn] = w™*[n] for all k and m. Intuitively, one can view L[n] as the instantaneous 
functional that the optimal RA maximizes at instant n. 

Key for the design of the optimal sensing is to understand the effect of the belief on the 
performance of the secondary network, thus, on L[n). For such a purpose, we first define the IRI 
for the SUs in channel k as L SUjk [n] := max ? L 9 SUk [n]. Then, we use L S u,k[n] to define the nominal 
IRI vector l k [n] as 

/ r i ( L suA n ] \ n ~ 

W ■= ( T r r n • (16) 



L S u,k[ n } - Ok[n]J' 

Such a vector can be used to write L k [n] as a function of the belief bf [n]. Specifically, 

L k [n}=[l T k [n}h s k [n}] + . (17) 

This suggests that the optimization of the sensing (which affects the value of bf [n]) can be performed 
separately for each of the channels. Moreover, (fTTT) also reveals that L k [n] can be viewed as the 
expected IRI: the second entry of l k [n] is the IRI if the PU is present, the first entry of l k [n] is 
the IRI if it is not, and the entries of bf [n] account for the corresponding probabilities, so that the 
expectation is carried over the SIPN uncertainties. Equally important, while the value of bf [n] is 
only available after making the sensing decision, the value of l k [n] is available before making such 
a decision. In other words, sensing decisions do not have an impact on l k [n), but only on bf [n). 
These properties will be exploited in the next section. 
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V. Optimal sensing 

The aim of this section is leveraging the results of Sees. |ni] and [IV] to design the optimal sensing 
scheme. Recall that current sensing decisions have an impact not only on the current reward (cost) 
of the system, but also on future rewards. This in turn implies that future sensing decisions are 
affected by the current decision, so that the sensing decisions across time form a string of events 
that has to be optimized jointly. Consequently, the optimization problem has to be posed as a DP. 
The section is organized as follows. First, we present a brief summary of the relevant concepts 
related to DP and POMDP which will be important to address the design of the optimal sensing for 
the system setup considered in this paper (Sec. IV-AI) . Readers familiar with DP and POMDP can 
skip that section. Then, we substitute the optimal RA policy obtained in Sec. [IV] into the original 
optimization problem presented in Sec. [inland show that the design of the optimal sensing amounts 
to solving a set of separate unconstrained DP problems (Sec. IV-BI) . Lastly, we obtain the solution 
to each of the DP problems formulated (Sec. IV-CI) . It turns out that the optimal sensing leverages: 
£ k , the sensing cost at time n; the expected channel IRI at time n, which basically depends on 
lk[n] (SISN) and the pre-decision belief (SIPN); and the future reward for time slots n' > n. The 
future reward is quantified by the value function associated with each channel's DP, which plays 
a fundamental role in the design of our sensing policies. Intuitively, a channel is sensed if there 
is uncertainty on the actual channel occupancy (SIPN) and the potential reward for the secondary 
network is high enough (SISN). The expression for the optimal sensing provided at the end of this 
section will corroborate this intuition. 

A. Basic concepts about DP 

DP is a set of techniques and strategies used to optimize the operation of discrete-time complex 
systems, where decisions have to be made sequentially and there is a dependency among decisions 
in different time instants. These systems are modeled as state-space models composed of: a set 
of state variables u[n] E U\ a set of actions which are available to the controller and which can 
depend on the state a[n] E A(u[n]); a transition function that describes the dynamics of the system 
as a function of the current state and the action taken u[n + 1] = U'(u[n],a[n],uj[n + 1]), where 
u[n + 1] is a random (innovation) variable; and a function that defines the reward associated with 
a state transition or a state-action pair R(u[n], a[n]). In general, finding the optimal solution of a 
DP is computationally demanding. Unless the structure of the specific problem can be exploited, 
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complexity grows exponentially with the size of the state space, the size of the action space, and the 
length of the temporal horizon. This is commonly referred to as the triple curse of dimensionality 
[12111 . Two classical strategies to mitigate such a problem are: i) framing the problem into a specific, 
previously studied model and ii) find approximate solutions that allow to reduce the computational 
cost in exchange for a small loss of optimality. 

DP problems can be classified into finite-horizon and infinite-horizon problems. For the latter 
class, which is the one corresponding to the problem in this paper, it is assumed that the system is 
going to be operated during a very large time lapse, so that actions at any time instant are chosen 
to maximize the expected long-term reward, i.e., 



max IE 



£yi2(u[*],a[f]) 



t=n 



(18) 



The role of the discount factor 7 e (0, 1) is twofold: i) it encourages solutions which are focused 
on early rewards; and ii) it contributes to stabilize the numerical calculation of the optimal policies. 
In particular, the presence of 7 guarantees the existence of a stationary policy, i.e. a policy where 
the action at a given instant is a function of the system state and not the time instant. Note that 
multiplying (fT8l) by factor (1 —7), so that the objective resembles the one used through paper, does 
not change the optimal policy. 

Key to solve a DP problem is defining the so-called value function that associates a real number 
with a state and a time instant. This number represents the expected sum reward that can be obtained, 
provided that we operate the system optimally from the current time instant until the operating 
horizon. If a minimization formulation is chosen, the value function is also known as cost-to-go 
function [3J. The relationship between the optimal action at time n and the value function at time 
n, denoted as Ki(-), is given by Bellman's equations 0, EQ: 

V n (u[n}) =max{R(u[n],a) + lS ul [V n+1 (U'(u[n],a,u))}} (19a) 

a 

a*[n] = a*(u[n\) = argmax{i?(w[n], a) + E w [V n +i (U'(u[n],a,u))}} (19b) 

a 

where to is the information that arrives at time n + 1 and thus we have to take the expectation over 
ijj. The value function for different time instants can be recursively computed by using backwards 
induction. Moreover, for infinite horizon formulations with 7 < 1, it holds that the value function 
is stationary. As a result, the dependence of V n (-) on n can be dropped and (fT9l can be rewritten 
using the stationary value function V(-). In this scenario, alternative techniques that exploit the fact 
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of the value function being stationary (such as "value iteration" and "policy iteration" [|2T1 Ch. 2]) 
can be used to compute V(-). 

1) Partially Observable Markov Decision Processes: Markov Decision Processes (MDPs) are an 
important class within DP problems. For such problems, the state transition probabilities depend 
only on the current state-action pair, the average reward in each step only depends on the state-action 
pair, and the system state is fully observable. MDPs can have finite or infinite state-action spaces. 
MDPs with finite state-action spaces can be solved exactly for finite-horizon problems. For infinite 
horizon problems, the solution can be approximated with arbitrary precision. A partially Observable 
MDP (i.e. a POMDP) can be viewed as a generalization of MDP for which the state is not always 
known perfectly. Only an observation of the state (which may be affected by errors, missing data 
or ambiguity) is available. To deal with these problems, it is assumed that an observation function, 
which assigns a probability to each observation depending on the current state and action, is known. 
When dealing with POMDPs, there is no distinction between actions taken to change the state of the 
system under operation and actions taken to gain information. This is important because, in general, 
every action has both types of effect. 

The POMDP framework provides a systematic method of using the history of the system (actions 
and observations) to aid in the disambiguation of the current observation. The key point is the 
definition of an internal belief state accounting for previous actions and observations. The belief 
state is useful to infer the most probable state of the system. Formally, the belief state is a probability 
distribution over the states of the system. Furthermore, for POMDPs this probability distribution 
comprises a sufficient statistic for the past history of the system. In other words, the process over 
belief states is Markov, and no additional data about the past would help to increase the agent's 
expected reward [1J. The optimal policy of a POMDP agent must map the current belief state into 
an action. This implies that a discrete state-space POMDP can be re-formulated (and viewed) as 
a continuous-space MDP. This equivalent MDP is defined such that the state space is the set of 
possible belief spaces of the POMDP-the probability simplex of the original state space. The set of 
actions remains the same; and the state-transition function and the reward functions are redefined 
over the belief states. More details about how these functions are redefined in general cases can be 
found at |[TTT|. Clearly, our problem falls into this class. The actual SIPN is Markovian, while the 
errors in the CSI render the SIPN partially observable. These specific functions corresponding to 
our problem are presented in the following sections. 
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B. Formulating the optimal sensing problem 

The aim of this section is to formulate the optimal decision problem as a standard (unconstrained) 
DP. The main task is to substitute the optimal RA into the original optimization problem in ©■ 
Recall that optimization in ® involved variables s k [n], w™[n] and and the sets of constraints 

in (l8bl) . (|8cl) and (l8dl) . the latter requiring Sk[n] £ {0, 1}. When the optimal solution for w\ 
Pk*[ n } presented in Sec. |IV] is substituted into ©, the resulting optimization problem is 

max Ut\ra* 



n 



s. to : s k [n] £ {0, 1}, 
where Ut\ra* stands for the total utility given the optimal RA and is defined as 



(20a) 
(20b) 



U> 



T\RA* 



E 



N-1 



Jim 5^(1 - 7 ) 7 " £ ( - + £ OK[n 



n=0 



(21) 



which, using the definitions introduced in Sec. IIV-BI can be rewritten as [cf. (fl31) and (fT7l) l 



U> 



T\RA* 



E 



E 



JV-1 



lim ^(l-7)7^-^ Sfc [n] + J L fe [n] 



n=0 
AT-l 



lim $>-7)7 n E-^M+ Z ^ [n]b " [n] 

N^too *■ — * ^ — ' 



n=0 



(22) 



(23) 



The three main differences between (|20l) and the original formulation in ([8]) are that now: i) the 
only optimization variables are Sk[n}; ii) because the optimal RA fulfills the constraints in (|8bl) and 
(l8cl) . the only constraints that need to be enforced are (l8dl) . which simply require Sfc[n] £ {0, 1} [cf. 
(|20bl) l; and iii) as a result of the Lagrangian relaxation of the DR the objective has been augmented 
with the terms accounting for the dualized constraints. 

Key to find the solution of (l20l) will be the facts that: i) ak[n) is independent of h™[n], and ii) that 
Ofc[n] is independent of a^ln] for k ^ k'. The former implies that the state transition functions for 
ak[n] do not depend on h™{n], while the latter allows to solve for each of the channels separately. 
Therefore, we will be able to obtain the optimal sensing policy by solving separate DPs (POMDPs), 
which will rely only on state information of the corresponding channel. Specifically, the optimal 
sensing can be found as the solution of the following DP: 



max > E 

{«*[n]e{0,l}}^-' 



N-1 



n=0 



lim £(l-7)7 n l-tk8k[n]+ l T k [n]b s k [n] 



(24) 
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(25) 



which can be separated channel- wise. Clearly, the reward function for the A;th DP is 

Rk[n) = -ZkS k {n] + ll[n]b s k [n] 

The structure of (1251) manifests clearly that this is a joint design because sp. [n] affects the two terms 
in (T2~5l) . The first term (which accounts for the cost of the sensing scheme) is just the product of 
constant and the sensing variable Sk[n). The second term (which accounts for the reward of the 
RA) is the dot product of vectors lk[n] (which does not depend on Sk[n}) and bf [n] (which does 
depend on s*.[n]). The expression in ([25]) also reveals that lk[n] encapsulates all the information 
pertaining the SUs which is relevant to find s* k [n]. In other words, in lieu of knowing /i™[n], 
and p r k n *[n], it suffices to know lk[n). 

Relying on (|24l) and (|25l) . and taking into account that the problem can be separated across 
channels, at each time slot n the optimal sensing for channel k can be obtained as [cf. (fT8l) l 



N-l 



s* k [n] = arg max < lim N (1 — 7)7^ i?jJi]|sJn] = s 

s6{0,l} I Af-s>oo * — ' L 



(26) 



t=n 



C. Bellman 's equations and optimal solution 

To find sl[n], we will derive the Belmman's equations associated with (|26|) . For such a purpose, 
we split the objective in (|26l) into the present and future rewards and drop the constant factor 
(1 — 7)7™. Then, (l26l) can be rewritten as 



N-l 



Sun 



arg max 

s£{0,l} 



E 



Rk[n}\s k [n] = s + 7 lim V" 



7 



t-n-l 



E 



Sk\n\ 



(27) 



t=n+l 



It is clear that the expected reward at time slot t = n depends on s k \n] -recall that both terms in 
(|25l) depend on Sh [n] . Moreover, the expected reward at time slots t > n also depend on the current 
Sk[n\. The reason is that b k [t] for t > n depend on the Sk[n] [cf. ©]. This is testament to the 
fact that our problem is indeed a POMDP: current actions that improve the information about the 
current state have also an impact on the information about the state in future instants. 

To account for that effect in the formulation, we need to introduce the value function Vfc(-) 
that quantifies the expected sum reward on channel k for all future instants. Recall that due to 
the fact of (l26l) being and infinite horizon problem with 7 < 1, the value function is stationary 
and its existence is guaranteed [cf. Sec. IV- All . Stationarity implies that the expression for Vk(-) 
does not depend on the specific time instant, but only on the state of the system. Since in our 
problem the state information is formed by the SISN and the SIPN, Vfc(-) should be written as 
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V k (B k [n], h k [n}). However, since h k [n] is i.i.d. across time and independent of s k [n], the alternative 
value function V k (B k [n]) :=~E h [V k (B k [n], h k [n])\, where E h denotes that the expectation is taken 
over all possible values of h fc [n], can also be considered. The motivation for using V k (B k [n]) instead 
of V k (B k [n], h k [n]) is twofold: it emphasizes the fact that the impact of the sensing decisions on 
the future reward is encapsulated into B k [n], and V k {-) is a one-dimensional function, so that the 
numerical methods to compute it require lower computational burden. 

Based on the previous notation, the standard Bellman's equations that drive the optimal sensing 
decision and the value function are [cf. (1271) 1 



s* k [n] = arg max {E z [i4[n]k[n] = s] + 7 E Z [V k {B k [n + l])\s k [n] = s] } (28) 

sG{0,1} 

V k (B k [n}) = E h [max {E z [R k [n]\s k [n] = s] + 7 E Z [V k (B k [n + 1]) \s k [n] = s] }] , (29) 

where E z denotes taking the expectation over the sensor outcomes. Equation (1281) exploits the 
fact of the value function being stationary, manifests the dynamic nature of our problem, and 
provides further intuition about how sensing decisions have to be designed. The first term in (1281) . 
E z [R k [n] | s k [n] = s] , is the expected short-term reward conditioned to s k [n] = s, while the second 
term, 1E z [V k (B k [n + 1]) \s k [n] = s], is the expected long-term sum reward to be obtained in all 
future time instants, conditioned to s k [n] = s and that every future decision is optimal. Equation 
(|29l) expresses the condition that the value function V k {-) must satisfy in order to be optimal (and 
stationary) and provides a way to compute it iteratively. 

Since obtaining the optimal sensing decision s* k [n] at time slot n (and also evaluating the station- 
ary condition for the value function) boils down to evaluate the objective in (1281) for s k [n] = 
and s k [n] = 1, in the following we obtain the expressions for each of the two terms in (1281) for 
both s k [n] = and s k [n] = 1. Key for this purpose will be the expressions to update the belief 
presented in Sec. III-Al Specifically, expressions in (Q])-@ describe how the future beliefs depend 
on the current belief, on the set of possible actions (sensing decision), and on the random variables 
associated with those actions (outcome of the sensing process if the channel is indeed sensed). 

The expressions for the expected short-term reward [cf. first summand in (|28l) l are the following. 
If s k [n] = 0, the channel is not sensed, there is no correction step, and the post-decision belief 
coincides with the pre-decision belief [cf. ©]. The expected short-term reward in this case is: 



E z [i4[n]|s fc [n] = 0] = l k [n} T b k [n] 



(30) 
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On the other hand, if s k [n] = 1, the expected short-term reward is found by averaging over the 
probability mass of the sensor outcome z k [n] and subtracting the cost of sensing 



l& z [Rk[n}\s k [n] = 1] = -f fc + Pr (^N = z|bfcN) h[n] T bf (b k [n), z) 

2e{o,i} 

which, by substituting © into (T3TI) . yields 



(3D 



E,[i2 fc [n]|s Jfc [7i] = 1] = ^ [z fc M T D 2 b fc [n]J . (32) 

Once the expressions for the expected short-term reward are known, we find the expressions for 
the expected long-term sum reward [cf. second summand in (1281) 1 for both s k [n] = and s k [n] = 1. 
If s k [n] = 0, then there is no correction step [cf. @], and using CO) 



E z [V k (B k [n + l])\s k [n] = 0] = V k ([P fe b fc [n]] 2 ) . 



(33) 



On the other hand, if s k [n) = 1, the belief for instant n is corrected according to ©, and updated 
for instant n + 1 using the prediction step in (OQ) as: 

T& z [V k (B k [n + l])\s k [n] = l] = ^ Pr(z\b k [n])V k ([P k b s k (b k [n], z)] 2 ) (34) 

2G{0,1} 

Clearly, the expressions for the expected long-term reward in (1331) and (l34l) account for the expected 
value of Vfc at time n + 1. Substituting (|3~0l> . (1321) . (1331) and (O into (HU) yields 



^(S fc [n]) = E h 



max 



6 + 



2g{0,l} 



Z fc [n] T b fe M + 7 y fc ([P fe b fe [n]] 2 ) 



Z fc [n] T V z b k [n] + 7 Pr (z k [n] I b fc [n] ) V k 



[P fc D 2 b fc [w]], 
l r D 2 b fc N ' 



• (35) 



where for the last term we have used the expression for bf (b fc [n], z) in ©. Equation (1351) is useful 
not only because it reveals the structure of V k (B k [n)) but also because it provides a mean to compute 
the value function numerically (e.g., by using the value iteration algorithm [21, Ch. 3]). 

Similarly, we can substitute the expressions (|30l) -d34l) into (1281) and get the optimal solution for 
our sensing problem. Specifically, the sensing decision at time n is 



££{0,1} 



l T k [n]b k [n]\ + 7 y*([P fc b fc [n]] 2 ) 



l{ [n}n z b k [n] + 7 Pr(z k [n] b fc [n])V* 



s*[n]=0 
s*[n] = l 

[P fc D z b fc [w]], 
l T V z b k [n] 



(36) 
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The most relevant properties of optimal sensing policy (several of them have been already pointed 
out) are summarized next: i) it can be found separately for each of the channels; ii) since it amounts 
to a decision problem, we only have to evaluate the long-term aggregate reward if Sk[n] = 1 (the 
channel is sensed at time n) and that if Sk[n) = (the channel is sensed at time n), and make the 
decision which gives rise to a higher reward; iii) the reward takes into account not only the sensing 
cost but also the utility and QoS for the SUs (joint design); iv) the sensing at instant n is found as a 
function of both the instantaneous and the future reward (the problem is a DP); vi) the instantaneous 
reward depends on both the current SISN and the current SIPN, while the future reward depends 
on the current SIPN and not on the current SISN; and vii) to quantify the future reward, we need 
to rely on the value function V&(-). The input of this function is the SIPN. Additional insights on 
the optimal sensing policy will be given in Sec. IVII-Al 

VI. Numerical results 

Numerical experiments to corroborate the theoretical findings and gain insights on the optimal 
policies are implemented in this section. Since an RA scheme similar to the one presented in this 
paper was analyzed in ifTTll . the focus is on analyzing the properties of the optimal sensing scheme. 
The readers interested can find additional simulations as well as the Matlab codes used to run them 
in http : //www. tsc.urjc.es/ ~ amarques/simulations/NumSimulations_lramjrl2.html. 

The experiments are grouped into two test cases. In the first one, we compare the performance 
of our algorithms with that of other existing (suboptimal) alternatives. Moreover, we analyze the 
behavior of the sensing schemes and assess the impact of variation of different parameters (correla- 
tion of the PUs activity, sensing cost, sensor quality, and average SNR). In the second test case, we 
provide a graphical representation of the sensing functions in the form of two-dimensional decision 
maps. Such representation will help us to understand the behavior of the optimal schemes. 

The parameters for the default test case are listed in Table HI Four channels are considered, each 
of them with different values for the sensor quality, the sensing cost and the QoS requirements. 
In most cases, the value of 6k has been chosen to be larger than the value of P^f D (so that the 
cognitive diversity can be effectively exploited), while the values of the remaining parameters have 
been chosen so that the test-case yields illustrative results. The secondary links follow a Rayleigh 
model and the frequency selectivity is such that the gains are uncorrelated across channels. The 
parameters not listed in the table are set to one. 



November 6, 2012 



DRAFT 



23 



TABLE I 

Parameters of the system under test. 



k 


SNR 


pFA 


pMD 
r k 


P fc 




Ok 


m 


Pm 


1 


5 dB 


0.09 


0.08 


[0.95, 0.05; 0.02, 0.98] 


1.00 


0.30 


1 


20.0 


2 


5 dB 


0.09 


0.08 


[0.95, 0.05; 0.02, 0.98] 


1.80 


0.05 


2 


16.0 


3 


5 dB 


0.05 


0.03 


[0.95, 0.05; 0.02, 0.98] 


1.00 


0.10 


3 


18.0 


4 


5 dB 


0.05 


0.03 


[0.95, 0.05; 0.02, 0.98] 


1.80 


0.10 


4 


10.0 



Test case 1: Optimality and performance analysis. The objective here is twofold. First, we want 
to numerically demonstrate that our schemes are indeed optimal. Second, we are also interested in 
assessing the loss of optimality incurred by suboptimal schemes with low computational burden. 
Specifically, the optimal sensing scheme is compared with the three suboptimal alternatives described 
next. A) A myopic policy, which is implemented by setting V(B) = \/B. This is equivalent to 
the greedy sensing and RA technique proposed in 11241 . since it only accounts for the reward of 
sensing in the current time slot and not in the subsequent time slots. B) A policy which replaces 
the infinite horizon value function with a horizon- 1 value function. In other words, a sensing policy 
that makes the sensing decision at time n considering the (expected) reward for instants n and 
n + 1. C) A rule-of-thumb sensing scheme implementing the simple (separable) decision function: 

s k [n] = l{L fe [n]6K fe ,e fc -? fc ]}l{i? fe [n]e[bf(A fe ,o),bf(A fe ,i)]}- In words, the channel is sensed if and only if the 
following two conditions are satisfied: a) the channel's IRI is greater than the sensing cost and less 
than the interfering cost minus the sensing cost; and b) the uncertainty on the primary occupancy 
is higher than that obtained from a unique, isolated measurement. 

Results are plotted in Fig. Q] The slight lack of monotonicity observed in the curves is due to the 
fact that simulations have been run using a Monte-Carlo approach. As expected, the optimal sensing 
scheme achieves the best performance for all test cases. Moreover, Figs. |l(a)| and |l(b)| reveal that 
the horizon-1 value function approximation constitutes a good approximation to the optimal value 
function in two cases: i) when the expected transition time is short (low time correlation) and ii) 
when the sensing cost is relatively small. The performance of the myopic policy is shown to be far 
from the optimal. This finding is in disagreement with the results obtained for simpler models in the 
opportunistic spectrum access literature G8l where it was suggested that the myopic policy could 
be a good approximation to solve the associated POMDP efficiently. The reason can be that the CR 
models considered were substantially different (the RA schemes in this paper are more complex 
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and the interference constraints are formulated differently). In fact, the only cases where the myopic 
policy seems to approximate the optimal performance are: i) if — >■ 0, this is expected because 
then the optimal policy is to sense at every time instant; and ii) if the PUs activity is not correlated 
across time (which was the assumption in [24J). 



Fig. |l(c)| suggests that the benefits of implementing the optimal sensing policies are stronger 
when sensors are inaccurate. In other words, the proposed schemes can help to soften the negative 
impact of deploying low quality (cheap) sensing devices. Finally, results in Fig. |l(d)| also suggest 
that changes in the average SNR between SU and NC, have similar effects on the performance of 
all analyzed schemes. 

Test case 2: Sensing decision maps. To gain insights on the behavior of the optimal sensing 
schemes, Fig. [2] plots the sensing decisions as a function of B k [n] and L k [n\. Simulations are 
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run using the parameters for the default test case (see Table U) and each subplot corresponds to 
a different channel k. Since the domain of the sensing decision function is two dimensional, the 
function itself can be efficiently represented as an image (map). To primary regions are identified, 
one corresponding to the pairs (Bk[n\, Lk[n]) which give rise to Sk[n] = 1, and one corresponding to 
the pairs giving rise to Sk[n] = 0. Moreover, the region where Sk[n] = is split into two subregions, 
the first one corresponding to X] m W ™H = 1 ( 1 - e -' wnen there is a user accessing the channel) and 
the second one when J2m w k l [ n ] = ( Le -' wnen tne system decides that no user will access the 
channel). Note that for the region where sj.[n] = 1, the access decision basically depends on the 
outcome of the sensing process z k [n] (if fact, it can be rigorously shown that J2 m w™[n] = 1 if and 
only if Zk[n) = 1). 

Upon comparing the different subplots, one can easily conclude that the size and shape of the 
Sfc[n] = 1 region depend on P k , P k A , P k 1L \ and b k - For example, the simulations reveal that 
channels with stricter interference constraint need to be more frequently sensed and thus the sensing 
region is larger: Fig. |2(a)| vs. Fig. |2(b)[ They also reveal that if the sensing cost increases, then 



the sensing region becomes smaller: Fig. |2(c)| vs. Fig. |2(d)| This was certainly expected because 



if sensing is more expensive, then resources have to be saved and used only when the available 
information is scarce. 

VII. Analyzing the joint schemes and future lines of work 

This section is intended to summarize the main results of the paper, analyze the properties of the 
optimal RA and sensing schemes, and briefly discuss extensions and future lines of work. 

A. Jointly optimal RA and sensing schemes 

The aim of this paper was to design jointly optimal RA and sensing schemes for an overlay 
cognitive radio with multiple primary and secondary users. The main challenge was the fact that 
sensing decisions at a given instant do not only affect the state of the system during the instant they 
are made, but also the state of the future instants. As a result, our problem falls into the class of 
DP, which typically requires a very high computational complexity to be solved. To address this 
challenge efficiently, we formulated the problem as an optimization over an infinite horizon, so that 
the objective to be optimized and the QoS constraints to be guaranteed were formulated as long-term 
time averages. The reason was twofold: i) short-term constraints are more restrictive than their long- 
term counterparts, so that the latter give rise to a better objective, and ii) long-term formulations are 
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in general easier to handle, because they give rise to stationary solutions. Leveraging the long-term 
formulation and using dual methods to solve our constrained sum-utility maximization, we designed 
optimal schemes whose input turned out to be: a) the current SISN; b) the current SIPN; c) the 
stationary Lagrange multipliers associated with the long-term constraints; and d) the stationary value 
(reward-to-go) function associated with the future long-term objective. While a) and b) accounted 
for the state of the system at the current time instant n, c) and d) accounted for the effect of sensing 
and RA in the long-term (i.e., for instants other than n). In particular, the Lagrange multipliers 7r m 
and 6 k accounted for the long-term cost of satisfying the corresponding constraints. This cost clearly 
involves all time instants and cannot be computed based only on the instantaneous RA. Similarly, 
the value function Vfc(-) quantified the future long-term reward. 
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Due to a judiciously chosen formulation, our problem could be separated across channels (and 
partially across users), giving rise to simple and intuitive expressions for the optimal RA and the 
optimal sensing policies. Specifically, for each time instant n, their most relevant properties were: 

• The optimal RA depends on: the SISN at instant n, the SIPN at instant n (post-decision 
belief), and the Lagrange multipliers [cf. (fT0l)-([T3l)l. The effect of the sensing policy on the 
RA is encapsulated into the post-decision belief vector (testament to the fact that this is a 
joint design). The effect of other time instants is encapsulated into 7r m (long-term price of 
the transmission power) and 9 k (long-term price for interfering the A;th PU). RA decisions are 
made so that the instantaneous IRI is maximized. The IRI is a trade-off between a reward 
(rate transmitted by the SU) and a cost (compound of the power consumed by the SU and the 
probability of interfering the PU). The RA is accomplished in a rather intuitive way: each user 
selects its power to optimize its own IRI, and then in each of the channel the system picks the 
SU who achieves the highest IRI (so that the IRI for that channel is maximized). 

• The optimum sensing depends on: the SISN at instant n, the SIPN at instant n (pre-decision 
belief), the Lagrange multipliers, and the value function [cf. (l36l)l. The optimum sensing is 
a trade-off between the expected instantaneous IRI (which depends on the current SISN and 
SIPN), the instantaneous sensing cost, and the future reward (which is given by the value 
function Vfc(-) and the current SIPN). Both the instantaneous IRI and the value function depend 
on the Lagrange multipliers and the RA policies, testament to the fact that this is a joint design. 

For each time instant, the CR had to run five consecutive steps that were described in detail in 
Sec. Unl The expressions for the optimum sensing in (|36l) had to be used in step T2.2, while the 
expressions for the optimal RA in (fT0l)-(fT3T) had to be used in step T3. Once the values of n m , 9k 
and Vfc(-) were found (during the initialization phase of the system), all five steps entailed very low 
computational complexity. 

B. Sensing cost 

To account for the cost of sensing a given channel, the additive and constant cost was 
introduced. So far, we considered that the value of £ fc was pre-specified by the system. However, the 
value of can be tuned to represent physical properties of the CR. Some examples follow. Example 
1: Suppose that to sense channel k, the NC spends a power P^ . In this case, can be set to 

= n NC P k NC , where n NC stands for the Lagrange multiplier associated with a long-term power 
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constraint on the NC. Example 2: Consider a setup for which the long-term rate of sensing is limited, 
mathematically, this can be accomplished by imposing that EfliniAr^oo Y^n=o(^- ~ 7)7 ns 4 n ]] ^ V* 
where r] represents the maximum sensing rate (say 10%). Let p k be the Lagrange multiplier associated 
with such a constraint, in this scenario should be set to ^ = p^. Example 3: Suppose that if the 
NC senses a channel, one fraction of the slot (say 25%) is lost. In this scenario £fc[n] = 0.25Lk[n] 
(time-varying opportunity cost). Linear combinations and stochastic versions of any of those costs 
are possible too. Similarly, if a collaborative sensing scheme is assumed, aggregation of costs across 
users can also be considered. 

C. Computing the multipliers and the value functions 

In this paper, both the objective to be optimized as well as the QoS requirements were formulated 
as long-term (infinite-horizon) metrics, cf. (l8al) . © and ©. As a result, the value function associated 
with the objective in (f8al) and the Lagrange multipliers associated with constraints © and © 
are stationary (time invariant). Obtaining Vk(-), 7r m and 6 k is much easier than obtaining their 
counterparts for short-term (finite-horizon) formulations. In fact, the optimum value of Vk(-), ^ m 
and 9k for the short-term formulations would vary with time, so that at every time instant a 
numerical search would have to be implemented. Differently, for the long-term formulation, the 
numerical search has to be implemented only once. Such a search can be performed with iterative 
methods which are known to converge. At each iteration those methods perform an average over 
the random (channel) state variables, which is typically implemented using a Montecarlo approach. 
Such a procedure may be challenging not only from a computational perspective, but also because 
there may be cases where the statistics of the random processes are not known or they are not 
stationary. For all those reasons, low-complexity stochastic estimations of Vk(-), vr m and 8k are 
also of interest. Regarding the Lagrange multipliers, dual stochastic subgradient methods can be 
used as low-complexity alternative with guaranteed performance (see, e.g., [|24|. ifTTl and references 
therein for examples in the field of resource allocation in wireless networks). Development of 
stochastic schemes to estimate Vk(-) is more challenging because the problem follows into the 
category of functional estimation. Methods such as Q-learning [21, Ch.8] or existing alternatives in 
the reinforcement learning literature can be considered for the problem at hand. Although design 
and analytical characterization of stochastic implementations of the schemes derived in this paper 
are of interest, they are out of the scope of the manuscript and left as future work. 
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D. Extending the results to other CR setups 

There are multiple meaningful ways to extend our results. One of them is to consider more 
complex models for the CSI. Imperfect SISN can be easily accommodated into our formulation. 
Non-Markovian models for the PU activity can be used too. The main problem here is to rely 
on models that give rise to efficient ways to update the belief, e.g., by using recursive Bayesian 
estimation; see [17] and references therein for further discussion on this issue. Finally, additional 
sources of correlation (correlation across time for the SISN and correlation across channels for 
the SIPN) can be considered too, rendering the POMDP more challenging to solve. Another line 
of work is to address the optimal design for CR layouts different from the one in this paper. An 
overlay CR was considered here, but underlay CR networks are of interest too. In such a case, 
information about the channel gains between the SUs and PUs would be required. Similarly, in 
this paper we limit the interference to the PU by bounding the average probability of interference. 
Formulations limiting the average interfering power or the average rate loss due to the interfering 
power are other reasonable options. Last but not least, developing distributed implementations for 
our novel schemes is also a relevant line of work. Distributed solutions should address the problem 
of cooperative sensing as well as the problem of distributed RA. Distributed schemes should be 
able to cope with noise and delay in the (state) information the nodes exchange, so that a previous 
step which is key for developing distributed schemes is the design of stochastic versions for the 
sensing and RA allocation policies. For some of this extensions, designs based on suboptimal but 
low complexity solutions may be a worth exploring alternative. 
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